An Alternative Method to Remove Duplicate Tuples
نویسنده
چکیده
The problem of performing database operations on parallel architectures has received much attention, both as applied and theoretical areas of research. Much of the attention has been focused on performing these operations on distributed-memory architectures, for example, a hyper-cube. Algorithms that perform, in particular, relational database operations on a hypercube typically exploit the hypercube's unique interconnectivity to not only process the relational operators eeciently but also perform dynamic load balancing. Certain relational operators (e.g., projection and union) can produce interim relations that contain duplicate tuples. As a result, an algorithm for a relational database system must address the issue of removing duplicate tuples from these interim relations. The algorithms accomplish this by compacting the relation into hypercubes of smaller and smaller dimensions. We present an alternative method for removing duplicate tuples from a relation that is distributed over a hypercube by using the embedded ring found in every hypercube. Through theoretical analysis of the algorithm and empirical observation, we demonstrate that using the ring to remove the duplicate tuples is signiicantly more eecient than using the hypercube.
منابع مشابه
Semantic Management of Deduplicate Tuples in the Relational Databases
Relational database is a collection of relations. Duplicate tuple existence is common in many real time relational databases. In a relational database, if the same real-world entity is represented by more than one tuple, then such tuples are called duplicate tuples. Finding duplicate tuples and then replacing them by one best tuple is called a fusion operation. Whenever duplicate tuples are fou...
متن کاملApproximate Joins for Relational Data
Krommydas, Ioannis, Evagelos, Georgia. MSc, Computer Science Department, University of Ioannina, Greece. June, 2008. Approximate Joins for Relational Data. Thesis Supervisor: Vassiliadis Panos. Relational databases often contain duplicate data entries. This may occur due to a variety of reasons, such as typographical errors, multiple conventions for recording database fields or other noise sour...
متن کاملAn Alternative Secondary Goal Approach to Modify Cross Efficiency Evaluation in Data Envelopment Analysis
The cross efficiency evaluation is used to performance measurement of decision making units in data envelopment analysis concept. One of the most important shortcoming of this method is existing alternative optimal solution and therefore, the efficiency scores are not unique. We are going to summarize the pervious models proposed by researchers and suggest an alternative secondary goal approach...
متن کاملEliminating Fuzzy Duplicates in Data Warehouses
1 Work done while visiting Microsoft Research Abstract The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches ...
متن کاملIndeterministic Handling of Uncertain Decisions in Duplicate Detection
In current research, duplicate detection is usually considered as a deterministic approach in which tuples are either declared as duplicates or not. However, most often it is not completely clear whether two tuples represent the same real-world entity or not. In deterministic approaches, however, this uncertainty is ignored, which in turn can lead to false decisions. In this paper, we present a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007